Welcome to a workshop on analysing spatial data! The downstream analyses used in this workshop is in theory applicable to any format of data where we have a matrix of genes by cells, along with their x-y coordinates, whether that’s from technologies such as PhenoCycler, IMC, Xenium, or MERRFISH.
In this workshop, we’ll be re-analysing a head and neck squamous cell carcinoma dataset provided by Angela Ferguson, and published in Clinical Cancer Research.
For some context, head and neck cutaneous squamous cell carcinomas (HNcSCC) are the second most common type of skin cancer. The majority of HNcSCC can be treated with surgery and good local control, but a subset of large tumours infiltrate subcutaneous tissue and are considered high risk for local recurrence and metastases. The key conclusion of this manuscript (amongst others) is that spatial information about cells and the immune environment can be used to predict primary tumour progression or metastases in patients. We will use our spicy workflow to reach a similar conclusion.
We will aim to cover:
Introduction to imaging data and how to store it in R
How to segment cells from your data in R
How to annotate the cells you have segmented out
What kind of insights can I gain from my spatial data?
This workshop is suited for a beginner audience, however it is assumed that you have an understanding of the basic science that underpins spatial transcriptomics/proteomics data (i.e I know what a gene/protein is, and why disease might cause a change in gene/protein expression).
This workshop will take you all the way through from images in the form of a TIFF file, all the way to getting nice plots and visualisations from this data.
Bioinformatics packages are typically stored on bioconductor. When
installing an R package from bioconductor, we need to first install
BiocManager (a library that allows us to access
bioconductor within R)
# To install biocmanager
install.packages("BiocManager")
I’ve also provided some code for installing all the packages that we will be using in this workshop. This will take some time to install, and you may encounter errors when you try to run this code, depending on what packages you may already have installed.
BiocManager::install(c("cytomapper", "scater", "tidySingleCellExperiment", "SpatialExperiment", "simpleSeg", "FuseSOM", "spicyR", "lisaClust"))
As a brief introduction to the packages, the first packages we will
want to install are cytomapper, scater,
tidySingleCellExperiment and
SpatialExperiment. These are a couple prerequisite packages
not developed by our group, but are super useful in helping us visualise
and load in our imaging data. dplyr and
ggplot2 are standard R packages for dealing with data.
library(cytomapper)
library(scater)
library(tidySingleCellExperiment)
library(SpatialExperiment)
library(dplyr)
library(ggplot2)
theme_set(theme_classic())
In this workshop, we’ll be introducing 4 packages from spicyWorkflow. These are:
simpleSeg: For segmenting and normalising your cells
FuseSOM: For clustering (and self-annotating) your cells
spicyR: For identifying differential localisation of cell types between conditions.
lisaClust: For identifying consistent spatial organisation (regions) of cell types.
ClassifyR: For identifying consistent spatial organisation (regions) of cell types.
library(simpleSeg)
library(FuseSOM)
library(spicyR)
library(lisaClust)
library(ClassifyR)
nCores = 40
BPPARAM <- simpleSeg:::generateBPParam(cores = nCores)
First, let’s talk about files. When we’re working with IMC images, we
normally have each image as a folder containing a number of TIFF images,
each one being an image for a specific marker/channel. We’ll want to
specify the directory that contains all of these folders, and then feed
that into the loadImages() function.
pathToImages <- "/dskh/nobackup/alexq/spicyWorkflow/inst/extdata/Ferguson_Images"
# Store images in a CytoImageList on_disk as h5 files to save memory.
images <- cytomapper::loadImages(
pathToImages,
single_channel = TRUE,
on_disk = TRUE,
h5FilesPath = HDF5Array::getHDF5DumpDir(),
BPPARAM = BPPARAM
)
mcols(images) <- S4Vectors::DataFrame(imageID = names(images))
As we’re reading the image channels directly from the names of the TIFF image, often these channel names will need to be cleaned for ease of downstream processing.
The channel names can be accessed from the CytoImageList
object using the channelNames() function.
cn <- channelNames(images) # Read in channel names
head(cn)
## [1] "139La_139La_panCK.ome" "141Pr_141Pr_CD20.ome"
## [3] "142Nd_142Nd_HH3.ome" "143Nd_143Nd_CD45RA.ome"
## [5] "146Nd_146Nd_CD8a.ome" "147Sm_147Sm_podoplanin.ome"
cn <- sub(".*_", "", cn) # Remove preceding letters
cn <- sub(".ome", "", cn) # Remove the .ome
head(cn)
## [1] "panCK" "CD20" "HH3" "CD45RA" "CD8a"
## [6] "podoplanin"
channelNames(images) <- cn # Reassign channel names
Similarly, the image names will be taken from the folder name containing the individual TIFF images for each channel. These will often also need to be cleaned.
head(names(images))
## [1] "ROI001_ROI 01_F3_SP16-001550_1E" "ROI002_ROI 02_F4_SP16-001550_1E"
## [3] "ROI003_ROI 03_F5_SP16-001550_1E" "ROI005_ROI 05_G4_SP17-002069_1F"
## [5] "ROI006_ROI 06_G5_SP17-002069_1F" "ROI007_ROI 07_G6_SP17-005715_1B"
nam <- sapply(strsplit(names(images), "_"), `[`, 3)
head(nam)
## [1] "F3" "F4" "F5" "G4" "G5" "G6"
names(images) <- nam # Reassigning image names
mcols(images)[["imageID"]] <- nam # Reassigning image names
Our simpleSeg R package on https://github.com/SydneyBioX/simpleSeg provides a series of functions to generate simple segmentation masks of images. These functions leverage the functionality of the EBImage package on Bioconductor. For more flexibility when performing your segmentation in R we recommend learning to use the EBimage package. A key strength of the simpleSeg package is that we have coded multiple ways to perform some simple segmentation operations as well as incorporating multiple automatic procedures to optimise some key parameters when these aren’t specified.
If your images are stored in a list or
CytoImageList they can be segmented with a simple call to
simpleSeg(). To summarise, simpleSeg() is an R
implementation of a simple segmentation technique which traces out the
nuclei using a specified channel using nucleus then dilates
around the traced nuclei by a specified amount using
discSize. The nucleus can be traced out using either one
specified channel, or by using the principal components of all channels
most correlated to the specified nuclear channel by setting
pca = TRUE.
In the particular example below, we have asked simpleSeg
to do the following:
nucleus = c("HH3"), we’ve asked simpleSeg to
trace out the nuclei signal in the images using the HH3 channel.pca = TRUE, simpleSeg segments out the
nuclei mask using a principal component analysis of all channels and
using the principal components most aligned with the nuclei channel, in
this case, HH3.cellBody = "dilate", simpleSeg uses a
dilation strategy of segmentation, expanding out from the nucleus by a
specified discSize.discSize = 3, simpleSeg dilates out from the
nucleus by 3 pixels.sizeSelection = 20, simpleSeg ensures that
only cells with a size greater than 20 pixels will be used.transform = "sqrt", simpleSeg square root
transforms each of the channels prior to segmentation.tissue = c("panCK", "CD45", "HH3"), we
specify a tissue mask which simpleSeg uses, filtering out all background
noise outside the tissue mask. This is important as these are tumour
cores, wand hence circular, so we’d want to ignore background noise
which happens outside of the tumour core.There are many other parameters that can be specified in simpleSeg
(smooth, watershed, tolerance,
and ext), and we encourage the user to select the best
parameters which suit their biological context.
masks <- simpleSeg(images,
nucleus = c("HH3"),
pca = TRUE,
cellBody = "dilate",
discSize = 3,
sizeSelection = 20,
transform = "sqrt",
tissue = c("panCK", "CD45", "HH3"),
cores = nCores
)
Try segmenting yourself! I’ve provided a zip folder containing 1 image - i.e. a single folder of 36 channels for you to play around with the parameters and see what you change.
# Code to load in a single image
pathToSingleImage <- "/dskh/nobackup/alexq/spicyWorkflow/inst/extdata/Ferguson_Images/ROI001_ROI 01_F3_SP16-001550_1E"
# Store images in a CytoImageList on_disk as h5 files to save memory.
singleImage <- cytomapper::loadImages(
pathToSingleImage,
single_channel = TRUE,
on_disk = TRUE,
h5FilesPath = HDF5Array::getHDF5DumpDir(),
BPPARAM = BPPARAM
)
mcols(singleImage) <- S4Vectors::DataFrame(imageID = names(singleImage))
cn <- channelNames(singleImage) # Read in channel names
cn <- sub(".*_", "", cn) # Remove preceding letters
cn <- sub(".ome", "", cn) # Remove the .ome
channelNames(singleImage) <- cn # Reassign channel names
nam <- sapply(strsplit(names(singleImage), "_"), `[`, 3)
names(singleImage) <- nam # Reassigning image names
mcols(singleImage)[["imageID"]] <- nam # Reassigning image names
Play around with these parameters!
# Write your own code!!!
singleMasks <- simpleSeg(singleImage,
nucleus = c("HH3"),
pca = TRUE,
cellBody = "dilate", # Maybe change me?
discSize = 3, # Change me!
sizeSelection = 20, # Change me!
transform = "sqrt", # Change me!
tissue = c("panCK", "CD45", "HH3")
)
plotPixels(image = singleImage["F3"],
mask = singleMasks["F3"],
img_id = "imageID",
colour_by = c("HH3"),
display = "single",
colour = list(HH3 = c("black","blue")),
legend = NULL,
bcg = list(
HH3 = c(1, 1, 2)
))
The plotPixels() function in cytomapper
makes it easy to overlay the mask on top of the nucleus intensity marker
to see how well our segmentation process has performed. Here we can see
that the segmentation appears to be performing reasonably.
plotPixels(image = images["F3"],
mask = masks["F3"],
img_id = "imageID",
colour_by = c("HH3", "CD31", "podoplanin"),
display = "single",
colour = list(HH3 = c("black","blue"),
CD31 = c("black", "red"),
podoplanin = c("black", "green") ),
legend = NULL,
bcg = list(
HH3 = c(1, 1, 2),
CD31 = c(0, 1, 2),
podoplanin = c(0, 1, 1.5)
))
In order to characterise the phenotypes of each of the segmented
cells, measureObjects() from cytomapper will
calculate the average intensity of each channel within each cell as well
as a few morphological features. By default, the
measureObjects() function will return a
SingleCellExperiment object, where the channel intensities
are stored in the counts assay and the spatial location of
each cell is stored in colData in the m.cx and
m.cy columns.
However, you can also specify measureObjects() to return
a SpatialExperiment object by specifying
return_as = "spe". As a SpatialExperiment
object, the spatial location of each cell is stored in the
spatialCoords slot, as m.cx and
m.cy, which simplifies plotting. In this demonstration, we
will return a SpatialExperiment object.
# Summarise the expression of each marker in each cell
cells <- cytomapper::measureObjects(masks,
images,
img_id = "imageID",
return_as = "spe",
BPPARAM = BPPARAM)
spatialCoordsNames(cells) <- c("x", "y")
To associate features in our image with disease progression, it is
important to read in information which links image identifiers to their
progression status. We will do this here, making sure that our
imageID match.
clinicalData <- read.csv("/dskh/nobackup/alexq/spicyWorkflow/inst/extdata/clinicalData_TMA1_2021_AF.csv")
rownames(clinicalData) <- clinicalData$imageID
clinicalData <- clinicalData[names(images), ]
# Put clinical data into SingleCellExperiment object
colData(cells) <- cbind(colData(cells), clinicalData[cells$imageID, ])
load("spe_Ferguson_2022_c.rda")
Normalisation is an extremely important step of any workflow, and we should always first visualise our marker intensities to determine of they require some sort of transformation or normalisation.
This reasons for normalisation are two-fold:
Let’s take a look at these effects in our dataset:
# Plot densities of CD3 for each image.
cells |>
join_features(features = rownames(cells), shape = "wide", assay = "counts") |>
ggplot(aes(x = CD3, colour = imageID)) +
geom_density() +
theme(legend.position = "none")
We can transform and normalise our data using the
normalizeCells function. In the
normalizeCells() function, we specify the following
parameters. transformation is an optional argument which
specifies the function to be applied to the data. We do not apply an
arcsinh transformation here, as we already apply a square root transform
in the simpleSeg() function.
method = c("trim99", "mean", PC1") is an optional argument
which specifies the normalisation method/s to be performed. Here, we: 1)
Trim the 99th percentile 2) Divide by the mean 3) Remove the 1st
principal component assayIn = "counts" is a required
argument which specifies what the assay you’ll be taking the intensity
data from is named. In our context, this is called
counts.
This modified data is then stored in the norm assay by
default. We can see that this normalised data appears more bimodal, not
perfectly, but likely to a sufficient degree for clustering, as we can
at least observe a clear CD3+ peak at 1.00, and a CD3- peak at around
0.3.
# Leave out the nuclei markers from our normalisation process.
useMarkers <- rownames(cells)[!rownames(cells) %in% c("DNA1", "DNA2", "HH3")]
# Transform and normalise the marker expression of each cell type.
cells <- normalizeCells(cells,
markers = useMarkers,
transformation = NULL,
method = c("trim99", "mean", "PC1"),
assayIn = "counts",
cores = nCores
)
# Plot densities of CD3 for each image
cells |>
join_features(features = rownames(cells), shape = "wide", assay = "norm") |>
ggplot(aes(x = CD3, colour = imageID)) +
geom_density() +
theme(legend.position = "none")
We can appreciate from the UMAP that there is a division of clusters, most likely representing different cell types. We next aim to empirically distinguish each cluster using our FuseSOM package for clustering.
Our FuseSOM R package can be found on bioconductor at https://www.bioconductor.org/packages/release/bioc/html/FuseSOM.html, and provides a pipeline for the clustering of highly multiplexed in situ imaging cytometry assays. This pipeline uses the Self Organising Map architecture coupled with Multiview hierarchical clustering and provides functions for the estimation of the number of clusters.
Here we cluster using the runFuseSOM function. We
specify the number of clusters to identify to be
numClusters = 10. We also specify a set of cell-type
specific markers to use, as we want our clusters to be distinct based
off cell type markers, rather than markers which might pick up a
transitioning cell state.
# Set seed.
set.seed(51773)
ct_markers <- c("podoplanin", "CD13", "CD31",
"panCK", "CD3", "CD4", "CD8a",
"CD20", "CD68", "CD16", "CD14", "CD66a")
# Generate SOM and cluster cells into 10 groups
cells <- runFuseSOM(
cells,
markers = ct_markers,
assay = "norm",
numClusters = 10
)
We can also observe how reasonable our choice of k = 10
was, using the estimateNumCluster() and
optiPlot() functions. Here we examine the Gap method, but
others such as Silhouette and Within Cluster Distance are also
available.
cells <- estimateNumCluster(cells, kSeq = 2:30)
optiPlot(cells, method = "gap")
We can begin the process of understanding what each of these cell
clusters are by using the plotGroupedHeatmap function from
scater. At the least, here we can see we capture all the
major immune populations that we expect to see, including the CD4 and
CD8 T cells, the CD20+ B cells, the CD68+ myeloid populations, and the
CD66+ granulocytes.
# Visualise marker expression in each cluster.
scater::plotGroupedHeatmap(
cells,
features = ct_markers,
group = "clusters",
exprs_values = "norm",
center = TRUE,
scale = TRUE,
zlim = c(-3, 3),
cluster_rows = FALSE,
block = "clusters"
)
We can then apply our labels to these cell types.
cells <- cells |>
mutate(cellType = case_when(
clusters == "cluster_1" ~ "GC",
clusters == "cluster_2" ~ "MC",
clusters == "cluster_3" ~ "OI",
clusters == "cluster_4" ~ "EP",
clusters == "cluster_5" ~ "SC",
clusters == "cluster_6" ~ "Undefined",
clusters == "cluster_7" ~ "EC",
clusters == "cluster_8" ~ "BC",
clusters == "cluster_9" ~ "TC_CD4",
clusters == "cluster_10" ~ "TC_CD8"
))
We might also be interested in how these clusters are distributed on the images themselves. Here we examine the distribution of clusters on image F3.
reducedDim(cells, "spatialCoords") <- spatialCoords(cells)
cells |>
filter(imageID == "F3") |>
plotReducedDim("spatialCoords", colour_by = "cellType")
We find it always useful to check the number of cells in each cluster. Here we can see that we have alot of squamous tumour cells and much fewer dendritic cells.
cells$cellType |>
table() |>
sort()
##
## Undefined BC MC GC TC_CD8 TC_CD4 EP OI
## 2575 3017 6167 7553 8684 11995 12261 12678
## EC SC
## 14259 76724
Another very popular method of visualising our clusters is using a UMAP. This takes very long to run and will break your R if you try to interrupt it, so probably don’t run this during the workshop.
set.seed(51773)
# Perform dimension reduction using UMAP.
cells <- scater::runUMAP(
cells,
subset_row = ct_markers,
exprs_values = "norm",
name = "normUMAP"
)
someImages <- unique(cells$imageID)[c(1, 5, 10, 20, 30, 40)]
# UMAP by imageID.
scater::plotReducedDim(
cells[, cells$imageID %in% someImages],
dimred = "normUMAP",
colour_by = "cellType"
)
load("spe_Ferguson_2022_c2.rda")
The colTest function allows us to quickly test for
associations between the proportions of the cell types and progression
status using either Wilcoxon rank sum tests or t-tests. Here we see a
p-value less than 0.05, but this does not equate to a small FDR.
# Perform simple student's t-tests on the columns of the proportion matrix.
testProp <- colTest(cells,
condition = "group",
feature = "cellType",
type = "ttest")
head(testProp)
## mean in group NP mean in group P tval.t pval adjPval cluster
## TC_CD8 0.057 0.034 2.70 0.011 0.11 TC_CD8
## TC_CD4 0.072 0.096 -2.10 0.044 0.22 TC_CD4
## Undefined 0.019 0.014 1.80 0.084 0.28 Undefined
## MC 0.038 0.047 -1.60 0.110 0.28 MC
## EC 0.099 0.085 1.50 0.140 0.28 EC
## BC 0.016 0.010 0.85 0.400 0.67 BC
Let’s examine one of these clusters using a boxplot.
prop <- getProp(cells, feature = "cellType")
clusterToUse <- rownames(testProp)[1]
prop |>
select(all_of(clusterToUse)) |>
tibble::rownames_to_column("imageID") |>
left_join(as.data.frame(colData(cells))[!duplicated(cells$imageID),], by = "imageID") |>
ggplot(aes(x = group, y = .data[[clusterToUse]], fill = group)) +
geom_boxplot()
Our spicyR package is available on bioconductor on https://www.bioconductor.org/packages/devel/bioc/html/spicyR.html
and provides a series of functions to aid in the analysis of both
immunofluorescence and imaging mass cytometry data as well as other
assays that can deeply phenotype individual cells and their spatial
location. Here we use the spicy function to test for
changes in the spatial relationships between pair-wise combinations of
cells.
Put simply, spicyR uses the L-function to determine localisation or dispersion between cell types. The L-function is an arbitrary measure of “closeness” between points, with greater values suggesting increased localisation, and lower values suggesting dispersion.
Here, we quantify spatial relationships using a combination of three
radii Rs = c(20, 50, 100) and mildly account for some
global tissue structure using sigma = 50. Further
information on how to optimise these parameters can be found in the vignette
and the spicyR paper.
spicyTest <- spicy(cells,
condition = "group",
cellTypeCol = "cellType",
imageIDCol = "imageID",
Rs = 1:10*10,
sigma = 50,
BPPARAM = BPPARAM)
topPairs(spicyTest, n = 10)
## intercept coefficient p.value adj.pvalue from
## TC_CD4__EP -17.291697 64.70955 0.02857272 0.6306348 TC_CD4
## TC_CD8__OI 2.520092 54.47818 0.02911094 0.6306348 TC_CD8
## EP__TC_CD4 -17.170495 63.77083 0.02976289 0.6306348 EP
## MC__Undefined -45.724067 70.93448 0.03858075 0.6306348 MC
## EP__BC -194.290564 192.49532 0.04171342 0.6306348 EP
## BC__EP -186.126674 192.01456 0.04293275 0.6306348 BC
## Undefined__MC -44.945367 66.67640 0.04679223 0.6306348 Undefined
## OI__TC_CD8 4.123409 50.11153 0.05211892 0.6306348 OI
## Undefined__Undefined 357.438334 -122.59698 0.05675713 0.6306348 Undefined
## OI__EP -16.606006 51.06253 0.08634602 0.7696853 OI
## to
## TC_CD4__EP EP
## TC_CD8__OI OI
## EP__TC_CD4 TC_CD4
## MC__Undefined Undefined
## EP__BC BC
## BC__EP EP
## Undefined__MC MC
## OI__TC_CD8 TC_CD8
## Undefined__Undefined Undefined
## OI__EP EP
We can visualise these tests using signifPlot where we
observe that cell type pairs appear to become less attractive (or avoid
more) in the progression sample.
# Visualise which relationships are changing the most.
signifPlot(
spicyTest,
breaks = c(-1.5, 1.5, 0.5)
)
spicyR also has functionality for plotting out
individual pairwise relationships. We can first try look into whether
the major tumour cell type localises with the major myeloid cell type,
and whether this localisation affects progression vs non-progression of
the tumour.
spicyBoxPlot(spicyTest,
from = "SC",
to = "GC")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Our lisaClust package https://www.bioconductor.org/packages/devel/bioc/html/lisaClust.html
provides a series of functions to identify and visualise regions of
tissue where spatial associations between cell-types is similar. This
package can be used to provide a high-level summary of cell-type
co-localisation in multiplexed imaging data that has been segmented at a
single-cell resolution. Here we use the lisaClust function
to clusters cells into 5 regions with distinct spatial ordering.
set.seed(51773)
# Cluster cells into spatial regions with similar composition.
cells <- lisaClust(
cells,
k = 4,
sigma = 50,
cellType = "cellType",
BPPARAM = BPPARAM
)
## Generating local L-curves. If you run out of memory, try 'fast = FALSE'.
We can try to interpret which spatial orderings the regions are
quantifying using the regionMap function. This plots the
frequency of each cell type in a region relative to what you would
expect by chance.
# Visualise the enrichment of each cell type in each region
regionMap(cells, cellType = "cellType", limit = c(0, 2))
By default, these identified regions are stored in the
regions column in the colData of our object.
We can quickly examine the spatial arrangement of these regions using
ggplot.
cells |>
filter(imageID == "F3") |>
plotReducedDim("spatialCoords", colour_by = "region")
While much slower, we have also implemented a function for overlaying the region information as a hatching pattern so that the information can be viewed simultaneously with the cell type calls.
# Use hatching to visualise regions and cell types.
hatchingPlot(
cells,
useImages = "F3",
cellType = "cellType",
nbp = 300
)
## Concave windows are temperamental. Try choosing values of window.length > and < 1 if you have problems.
If needed, we can again quickly use the colTest function
to test for associations between the proportions of the cells in each
region and progression status using either Wilcoxon rank sum tests or
t-tests.
# Test if the proportion of each region is associated
# with progression status.
testRegion <- colTest(
cells,
feature = "region",
condition = "group",
type = "ttest"
)
testRegion
## mean in group NP mean in group P tval.t pval adjPval cluster
## region_2 0.038 0.022 1.50 0.15 0.60 region_2
## region_1 0.200 0.220 -0.63 0.53 0.77 region_1
## region_4 0.150 0.140 0.46 0.65 0.77 region_4
## region_3 0.610 0.620 -0.29 0.77 0.77 region_3
reg <- getProp(cells, feature = "region")
regionToUse <- rownames(testRegion)[1]
reg |>
select(all_of(regionToUse)) |>
tibble::rownames_to_column("imageID") |>
left_join(as.data.frame(colData(cells)), by = "imageID") |>
ggplot(aes(x = group, y = .data[[regionToUse]], fill = group)) +
geom_boxplot()
# Create list to store data.frames
data <- list()
# Add proportions of each cell type in each image
data[["Proportions"]] <- getProp(cells, "cellType")
# Add pair-wise associations
spicyMat <- bind(spicyTest)
spicyMat[is.na(spicyMat)] <- 0
spicyMat <- spicyMat |>
select(!condition) |>
tibble::column_to_rownames("imageID")
data[["SpicyR"]] <- spicyMat
# Add proportions of each region in each image
# to the list of dataframes.
data[["LisaClust"]] <- getProp(cells, "region")
# Set seed
set.seed(51773)
# Preparing outcome vector
outcome <- cells$group[!duplicated(cells$imageID)]
names(outcome) <- cells$imageID[!duplicated(cells$imageID)]
idx <- names(sample(outcome[outcome == "NP"], 14))
outcome <- outcome[!names(outcome) %in% idx]
data <- lapply(data, function(x) x[!rownames(x) %in% idx,])
# Perform cross-validation of a random forest model
# with 100 repeats of 5-fold cross-validation.
cv <- crossValidate(
measurements = data,
outcome = outcome,
classifier = "randomForest",
nFolds = 5,
nRepeats = 50,
nCores = nCores
)
Here we use the performancePlot function to assess the
AUC from each repeat of the 5-fold cross-validation. We see that the
lisaClust regions appear to capture information which is predictive of
progression status of the patients.
# Calculate AUC for each cross-validation repeat and plot.
performancePlot(
cv,
metric = "AUC",
characteristicsList = list(x = "Assay Name")
)
## Warning in .local(results, ...): AUC not found in all elements of results.
## Calculating it now.
library(grid)
samplesMetricMap(cv)
## Warning in .local(results, ...): Sample Accuracy not found in all elements of
## results. Calculating it now.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_tile()`).
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (1-1,1-1) arrange text[GRID.text.1663]
set.seed(51773)
cv <- crossValidate(
measurements = data,
outcome = outcome,
classifier = "randomForest",
nFolds = 5,
nRepeats = 50,
multiViewMethod = "merge",
nCores = nCores
)
performancePlot(
cv,
metric = "AUC",
characteristicsList = list(x = "Assay Name")
)
## Warning in .local(results, ...): AUC not found in all elements of results.
## Calculating it now.